Skip to content

PreTrendsPower: Roth (2022) paper review + silent-failure guard#463

Merged
igerber merged 22 commits into
mainfrom
feature/roth-2022-paper-review
May 17, 2026
Merged

PreTrendsPower: Roth (2022) paper review + silent-failure guard#463
igerber merged 22 commits into
mainfrom
feature/roth-2022-paper-review

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 17, 2026

Summary

  • New Roth (2022) paper review at docs/methodology/papers/roth-2022-review.md (298 lines) — methodology audit for the PreTrendsPower estimator, with registry-candidate block holistically split into ### Paper-derived requirements + ### Library design choices subsections and explicit boundary markers.
  • REGISTRY.md ## PreTrendsPower: 3 new **Note:** blocks documenting deviations surfaced by the audit — diag-Σ_22 fallback on CS/SA adapters, count-based linear weights (not Roth's δ_t = γ·t for irregular grids), and the new power_at(custom) silent-failure guard.
  • diff_diff/pretrends.py: PreTrendsPowerResults.power_at() now raises NotImplementedError for violation_type="custom" instead of silently substituting equal weights; unknown-violation_type else branch hardened to raise ValueError (mirrors _get_violation_weights).
  • METHODOLOGY_REVIEW.md: PreTrendsPower row records paper review on file (2026-05-17); promotion checklist updated.
  • TODO.md: 5 new rows under Methodology/Correctness tracking deferred follow-ups (Σ_22 fidelity, R-package revision pin, helper custom support, weight persistence, linear-units alignment).
  • New regression tests: test_power_at_raises_on_custom_violation_type + test_compute_pretrends_power_rejects_custom_violation_type + test_compute_mdv_rejects_custom_violation_type (93/93 passing across test_pretrends.py + test_pretrends_event_study.py).

Methodology references (required if estimator / math changes)

  • Method name(s): PreTrendsPower (paper-derived methodology audit); PreTrendsPowerResults.power_at() (silent-failure guard).
  • Paper / source link(s): Roth, J. (2022). Pretest with Caution: Event-Study Estimates after Testing for Parallel Trends. AER:I 4(3), 305-322. https://doi.org/10.1257/aeri.20210236
  • Any intentional deviations from the source (and why) — all documented inline in REGISTRY.md ## PreTrendsPower:
    1. CS/SA covariance source: shipped adapter uses diag(ses^2) rather than full pre-period Σ_22 for staggered result types (CS deliberate where event_study_vcov is available; SA forced because SunAbrahamResults does not expose an event-study VCV surface). Not provably conservative; tracked in TODO.md for the PR-B audit.
    2. linear violation pattern: _get_violation_weights("linear") constructs [n_pre-1, ..., 1, 0] from n_pre alone — does not honor actual relative-time labels, so reported MDV is not in Roth's γ units for irregular / anticipation-shifted grids. Tracked in TODO.md.
    3. Helper/class API split + power_at(custom) silent-failure guard: compute_pretrends_power / compute_mdv reject violation_type="custom" (no violation_weights parameter); the class-direct path works at fit time but PreTrendsPowerResults.power_at() now raises NotImplementedError for custom fits (was a silent equal-weights fallback at pretrends.py:230-231). Tracked in TODO.md for weight-persistence follow-up.

Validation

  • Tests added/updated: tests/test_pretrends.py — 3 new regression tests as listed above. Full test_pretrends.py + test_pretrends_event_study.py: 93/93 passing.
  • Backtest / simulation / notebook evidence: N/A — paper-review-scoped PR; no tutorial or notebook changes.

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

🤖 Generated with Claude Code

igerber and others added 21 commits May 17, 2026 07:43
PR-A of the 3-PR PreTrendsPower methodology review sequence (paper review
-> implementation audit -> R parity goldens). Produces the structured
methodology contract that will inform the PR-B audit of diff_diff/pretrends.py
against the paper's exact equations.

The review covers:
- Propositions 1-4 (bias decomposition, sign-of-bias under monotone trend,
  variance decomposition, convexity gives variance reduction)
- Power calculation algorithm for the slope gamma_p at target power p
- Bias and coverage calculation framework (analytical via tmvtnorm-equivalent
  truncated MVN moments; simulation fallback)
- Implementation notes and 11-item gaps list flagging items the PR-B audit
  will need to resolve (notably: NIS vs Wald acceptance region, violation-type
  extensions beyond linear, R-package version pin for parity)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 covariance block convention: add explicit "post-first" block decomposition
  paragraph defining Sigma_11=Var[beta_hat_post], Sigma_22=Var[beta_hat_pre],
  Sigma_12=Cov(beta_hat_post, beta_hat_pre). Roth's stacked vector is (pre, post)'
  but his proofs partition Var[(beta_hat_post, beta_hat_pre)'] post-first; without
  the explicit convention, Sigma_22 references in the propositions were ambiguous.
- P1 plug-in variance dimensional bug: change sigma^2_{tau_hat} = l' Sigma l to
  l' Sigma_11 l (since l in R^M and Sigma_11 = Var[beta_hat_post] is the only
  block of correct dimensions). Add explicit reminder pointing back to the
  convention.
- P1 missing Note/Deviation labels in copy-ready Registry section: inline
  **Note (deviation from paper):** for slope-of-best-fit acceptance region,
  constant/last_period/custom violation types, and the slope pretest_form
  enum value. Mark joint Wald + custom B(Sigma) as paper-supported framework
  (Propositions 1, 3, 4 apply) but not separately tabulated. Mark "individual"
  (NIS) as the only paper-analyzed form.
- P3 brittle line reference: replace "line 2758" with the heading reference
  "## PreTrendsPower stub".
- Algorithm step 2 cleanup: clarify sigma_t = sqrt(Sigma_22[t, t]) under the
  new convention.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P2 API surface drift: add explicit Status disclaimer at top of "Methodology
  Registry Entry" section flagging the block as proposed-replacement-text-not-
  yet-merged (mirrors the established goodman-bacon-2021-review.md:14 pattern).
  Restructure the Tuning Parameters table to add a Status column distinguishing
  "Current" (matches diff_diff/pretrends.py:442-447) from "Proposed" (requires
  API extension via the PR-B audit), so the block is not mistaken for the
  shipped surface.
- P2 Assumption 1 design-heuristic overclaim: drop the "balanced panel +
  cluster-robust + equicorrelated" design-metadata heuristic. Roth's
  Assumption 1 is a condition on the estimated Sigma, not on the experimental
  design. Replace with the requirement that any sharper sign-of-bias warning
  must come from a direct numerical check of Sigma's structure.
- P3 dead internal reference: remove the parenthetical pointer to
  `feedback_verdict_powered_by_tools.md` (a Claude private-memory artifact,
  not a git-tracked file in this repo).
- P3 source-of-truth split: addressed by the Status disclaimer above, which
  explicitly names the existing PreTrendsPower REGISTRY stub as the sole
  authoritative methodology contract until PR-B lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P1 Proposition 4 convexity overclaim: R1/R2 fixes added language to the
Alternative-acceptance-regions block, requirements checklist, pretest_form
row, and acceptance_region row that said "Propositions 1, 3, 4 apply to any
B(Sigma)". That overstates Roth: Proposition 1 (conditional mean) and
Proposition 3 (conditional variance) hold for any measurable B, but
Proposition 4 (variance reduction / over-coverage under parallel trends)
begins "Suppose that B(Sigma) is a convex set." Nonconvex custom pretests
lose the variance-reduction guarantee even though the conditional-moment
formulas still hold.

Fixed:
- "Custom user-supplied B(Sigma)" bullet now distinguishes Props 1+3 (any B)
  from Prop 4 (requires convex B) and explicitly notes nonconvex pretests
  lose the over-coverage guarantee.
- Requirements checklist now reads "joint Wald (convex, Props 1+3+4 all
  apply); custom B (Props 1+3 apply to any measurable B, Prop 4 only if B is
  convex)".
- `pretest_form` row in tuning parameters: same convexity split.
- `acceptance_region` row: "Propositions 1, 3 apply to any measurable B;
  Proposition 4 / variance-reduction guarantee additionally requires B to
  be convex".

Joint Wald (line 47) already correctly conditioned on convexity ("apply to
this B since it is convex") and was left as-is.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 nonlinear-trend "deviation" overstatement: walk back the "**Note
  (deviation from paper):**" labels on nonlinear violation hypotheses
  (constant level shift, last-period jump, custom delta vector). Roth's
  Section III ("Practical Recommendations") explicitly endorses power
  analyses against hypothesized nonlinear trends through the `pretrends`
  package, so the GENERAL nonlinear-power capability is paper-supported;
  only the SPECIFIC named API parameterizations are R-package conventions.
  Updates: Requirements checklist line and Gaps "Nonlinear violations" +
  "Custom delta" bullets reframed to distinguish paper-supported framework
  (Section III + Propositions 1+3 applying to any trend) from package-API
  parameterizations.
- P1 absolute "UNDER-cover" claim: Roth's wording after Prop 4 is "CIs will
  tend to under-cover if the bias is sufficiently large" — the under-coverage
  direction is contingent on bias magnitude, not universal. Updated the
  Implication paragraph after Proposition 4 to keep the conditionality
  ("tend to UNDER-cover **if the bias is sufficiently large** ... not
  universal").
- P2 "stub" mischaracterization: the live `## PreTrendsPower` entry at
  REGISTRY.md:2758-2808 is a populated 50-line block (Wald-test framed),
  not a stub. Updated the Status header to describe it accurately as "a
  populated 50-line block framed primarily around a joint-Wald pre-trends
  test" and clarify that this file is a non-authoritative source audit.

Slope-of-best-fit "deviation from paper" labels (lines 48, 159, 195) kept
as-is: Roth does NOT analyze a slope-t-stat as an acceptance region in
Section I or Section III; the slope t-stat in Table 1 is only an observed
property of surveyed papers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 alpha/1.96 internal inconsistency: the proposed registry block exposed
  `alpha` as a parameter but hard-coded 1.96 in B_NIS, the plug-in CI, the
  power formula, the K=1 closed-form, and the coverage formulas. Promoted
  to REGISTRY this would document the wrong inference rule for any
  alpha != 0.05. Generalize all five formulas to z_{1-alpha/2}; keep "=1.96"
  parenthetical annotations to ground them at the paper's running default.
- P2 sample vs population identity: "beta_hat_pre, which equals delta_pre
  under no anticipation" muddied Roth's stochastic model. Roth Equations
  (1)-(2) keep beta_hat (random draw) and beta (mean) distinct; under no
  anticipation, beta_pre = delta_pre (population), not beta_hat_pre =
  delta_pre. Reworded to "beta_hat_pre, whose mean beta_pre equals
  delta_pre under no anticipation" with explicit "random draw is not itself
  equal to delta_pre" caveat.
- P2 "11 of 12" overcount: Section I.B of Roth (2022) does not support
  "11 of 12 use individual significance" as the count I wrote. The paper
  reports: 12 of 12 plot pointwise CIs (visual-NIS inspection possible),
  5 of 12 explicitly discuss individual significance, 1 of 12 reports a
  joint-significance test, several rely on visual inspection without a
  formal criterion. Restated using the paper's own breakdown and framed
  NIS as the implicit common denominator rather than a literal explicit-
  rule majority.
- P3 paper methodology vs library extension labeling: the Figure-1-style
  plotting and HonestDiD-composition checklist items inside the copy-ready
  block conflated paper-derived methodology with diff-diff design choices.
  Added "**Note (library extension):**" labels to both, distinguishing the
  paper-derived numerical content (bias and CI by gamma_p; methodology-
  agnostic on how beta_hat/Sigma_hat is produced) from the library
  presentation and cross-estimator composition decisions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per user direction (option A: holistic restructure) after three consecutive
rounds of "paper-derived vs library-extension boundary" P1s (R3 Prop 4
convexity, R4 nonlinear-trend "deviation", R6 analytical/simulation as
paper-required). The recurring class warranted naming the invariant and
fixing it structurally rather than another surface patch.

Holistic fix:
- Renamed the single Requirements-checklist block into two named
  subsections inside the copy-ready Registry entry:
    ### Paper-derived requirements (must satisfy to be faithful to Roth 2022)
    ### Library design choices (extensions beyond Roth 2022)
- Moved each non-paper item (simulation fallback, method/n_sim knobs,
  pretest_form/acceptance_region surface, named non-linear violation
  parameterizations, Figure-1 plotting interface, HonestDiD composition)
  into the Library Design Choices subsection with explicit framing that
  the library may keep, drop, or extend each item via the follow-up audit.
- Reworded the Tuning Parameters table's Status column to a clearer
  Source column ("Paper-derived" vs "Library extension") so every row's
  origin is unambiguous.

P2 (Table 2 factual error):
- Coverage bullet under "Empirical Findings" said unconditional gamma_{0.8}
  range was "53% to 98%". Actual Table 2 range for tau_bar unconditional
  under gamma_{0.8} is 14% (Deschenes et al. 2017) to 98% (Lafortune et al.;
  Markevich & Zhuravskaya). Fixed.

P3 (transient process / line refs):
- Dropped "PR-B in the 3-PR sequence" language from the Status banner and
  Tuning Parameters note.
- Replaced literal "diff_diff/pretrends.py:442-447" with stable symbol
  refs (`compute_pretrends_power`, `PreTrendsPowerResults`).
- Replaced "lines 2758-2808" REGISTRY range with a non-line-numbered
  description ("populated block framed primarily around a joint-Wald
  pre-trends test").

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 joint-Wald + tmvtnorm still in paper-derived checklist: the R6
  restructure narrowed the recurring "paper vs library" class but two
  items remained mislabeled — joint Wald (a paper-mentioned alternative
  used by 1 of 12 surveyed papers, not the empirical exercise) and the
  tmvtnorm-specific analytical backend (one of several valid numerical
  paths; the paper requires producing the right quantities, not using
  this specific solver). Moved both to "Library design choices
  (paper-supported alternatives and extensions)" with explicit framing
  that the paper requirement is at the QUANTITY level (conditional
  moments, power, bias, coverage), not at the API or numerical-backend
  level. Updated the subsection heading to "paper-supported alternatives
  and extensions beyond Roth 2022" to make the two categories explicit.
- P2 Proposition 4 summary row dropped "if bias is sufficiently large":
  the Key Theorems table row claimed CIs "UNDER-cover under violations"
  unconditionally, contradicting the conditional version stated correctly
  elsewhere in the file. Rewrote the row to limit the variance-reduction
  guarantee to its strict statement and pushed the CI coverage discussion
  into a clearly-conditioned addendum.
- P2 API misattribution: I had said `compute_pretrends_power` exposes
  `alpha, power, violation_type, violation_weights` (which is actually
  the `PreTrendsPower` class signature). The helper's actual signature
  is `compute_pretrends_power(results, M, alpha, target_power,
  violation_type, pre_periods)`. Updated the Status banner and Tuning
  Parameters note to name both surfaces explicitly with their actual
  signatures, and clarified that the follow-up audit will reconcile
  them with each other and with the proposed contract.
- Related P2-adjacent overclaim: line 217 said the library "implements
  a Roth-2022 framework"; line 270 said it "implements computations
  under arbitrary Sigma via Proposition 1". Both overstate what
  `pretrends.py` currently does. Tightened to "Wald-test-based pre-trends
  MDV/power workflow ... computes Wald power/MDV from the pre-period VCV
  block rather than the full arbitrary-Sigma Proposition 1/3/4
  conditional-moment computations", and reframed the heteroskedastic-
  Sigma gap as a future capability conditional on the Prop 1/3/4
  path being added.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R8 verdict was ✅ Looks good (no P0/P1) — addressing remaining polish items.

- P2 internal contradiction on simulation-fallback requirement: the
  "Computational shortcut (footnote 8)" paragraph said the library "should
  support both an analytical truncated-multivariate-normal path AND a
  simulation fallback", contradicting the Library design choices subsection
  which correctly classifies simulation fallback as a library robustness
  choice. Rewrote the paragraph to state the paper-derived requirement at
  the quantity level (compute the correct conditional moments and
  probabilities) and explicitly defer backend choice to the Library design
  choices subsection.
- P2 joint-Wald R-parity overclaim: the Gaps section said "library should
  implement both but test against R `pretrends` for the joint-Wald case
  (Roth's package supports it)." Verified via R-package source/docs that
  the published `pretrends` package exposes only NIS-based surfaces
  (`pretrends()`, `slope_for_power()`, `*_NIS` helpers) — no joint-Wald
  parity target. Rewrote to make explicit that joint Wald is theoretically
  admissible under the propositions but parity would need an independent
  fixture/derivation, not direct R-package parity.
- P3 composition overclaim: the Relation section listed
  `TwoWayFixedEffects` and "any estimator producing an event-study
  coefficient vector" as currently composable, but the shipped
  `compute_pretrends_power` adapter only dispatches `MultiPeriodDiDResults`,
  `CallawaySantAnnaResults`, `SunAbrahamResults` and raises `TypeError`
  otherwise. Split into "Currently composes with" (the three shipped
  adapter targets) and "theoretical compatibility" (extends to TWFE etc.,
  pending follow-up audit adapters).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 CS/SA covariance-source deviation: the R7 edit said current
  `pretrends.py` "computes Wald power/MDV from the pre-period variance-
  covariance block", but only the `MultiPeriodDiDResults` branch extracts
  the full pre-period sub-VCV (pretrends.py:592-601). The CS branch
  (pretrends.py:609-652) and SA branch (pretrends.py:660-687) hard-code
  `vcov = np.diag(ses^2)`, dropping the pre-period correlations the
  paper's Propositions rely on. Added an explicit "Note (deviation in
  current covariance-source)" bullet in the "Relation to Existing
  diff-diff Estimators" section documenting the per-result-type split and
  cross-referencing the conservative-variance framing in REPORTING.md.
  The follow-up audit will decide whether to extend full-sub-VCV
  extraction to CS/SA or keep the diag fallback as a deliberate variance-
  conservatism choice.
- P1 Assumption 1 model-vs-estimated Sigma: the R2 edit said
  "Assumption 1 is a condition on the *estimated covariance matrix*
  Sigma". Roth's Section II.B states Assumption 1 on the *model*
  covariance Sigma (the population VCV in Equation 1). Software can only
  inspect the *estimated* Sigma_hat, so any direct numerical check is a
  HEURISTIC implementation proxy, not the paper's assumption itself.
  Rewrote the Prop 2 Implementation-use cell to (a) name Sigma as the
  model covariance, (b) acknowledge Sigma_hat as the implementation
  surrogate, (c) require any sharper warning based on Sigma_hat be
  labeled heuristic.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
- P1 "conservative variance source" framing for diag fallback is wrong:
  R9 fix added a Note saying the CS/SA `diag(ses^2)` fallback could be
  retained as a "deliberate variance-conservatism choice". Roth's
  Propositions 1+3+4 (and the MDV/power computation w' Σ_22^{-1} w)
  operate on the full pre-period covariance block Σ_22 — dropping the
  off-diagonals is a non-paper approximation, not a paper-supported
  numerical option, and the direction of the discrepancy with the full-
  Σ_22 calc depends on the sign and magnitude of the dropped
  correlations (not guaranteed conservative). Reworded the Note to
  describe `diag(ses^2)` explicitly as a non-paper approximation; if the
  follow-up audit retains it, an explicit REGISTRY.md Note describing
  the approximation + its possible miscalibration is required instead.
- P3 R package version not pinned: R `pretrends` package API claims in
  the Joint-Wald and version-pin Gaps bullets were temporally unstable.
  Softened the Joint-Wald observation to "as observed at the time of
  this review (specific commit not pinned)" with an explicit ask for the
  follow-up audit to record the exact revision. Strengthened the
  version-pin Gaps bullet to enumerate exactly which surface claims
  depend on the unpinned R-package surface (NIS-only, no joint-Wald,
  `pretrends()` / `slope_for_power()` / `*_NIS` helpers).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R11 verdict was ✅ Looks good (second consecutive) — addressing remaining
polish items.

- P2 CS/SA covariance-source distinction: R9/R10 Note lumped CS and SA
  together as "even when an event-study VCV is available". That is true
  for CallawaySantAnnaResults (which persists event_study_vcov at
  staggered_results.py:126-128) but NOT for SunAbrahamResults (which
  does not expose an event-study or cohort covariance surface; verified
  in sun_abraham.py:30-88). Split the Note into per-result-type bullets:
  CS diag fallback is a deliberate choice (off-diagonals available);
  SA diag fallback is forced (off-diagonals not persisted on the result
  object, so Roth-faithful off-diagonal support requires upstream surface
  work first).
- P3 unsourced K<=5 / K>10 cutoffs: the "For K <= 5, analytical methods
  are fast. For K > 10, simulation is preferable" guidance in
  Computational Considerations was a hunch, not benchmarked or
  paper-derived. Relabeled as "tentative heuristic (not benchmarked in
  this review and not specified by the paper)" with explicit ask for
  the follow-up audit to benchmark or replace.
- P3 R-package version pin still implicit: R10 fix softened the Joint
  Wald observation to "as observed at the time of this review" but the
  reviewer wanted a more explicit catch-all. Strengthened the Gaps
  bullet to enumerate every R-package surface claim that needs
  re-verification at a pinned revision (NIS-based surface, no joint-
  Wald target, named helpers, nonlinear-trend support claim) and label
  them collectively provisional.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R12 verdict was ✅ Looks good (third consecutive) — addressing remaining
polish items.

- P2 "uninformative pretests are unambiguously harmful" overstates II.D:
  Roth's Section II.D + Equation (4) say the net publication-screening
  effect is ambiguous, with underpowered pretests being "least effective
  and potentially harmful" — not unambiguously harmful in every
  parameterization. Internally inconsistent with my own L266 summary
  ("net sign depends on which dominates"). Reworded the Publication-bias
  edge-case bullet to match the paper's own framing.
- P3 "adding more pre-periods does NOT help (and can hurt)" overstates
  Section I.D: paper only says additional distant pre-periods may not
  help / may be uninformative for near-treatment shocks. Softened to
  "may not help / may be uninformative" without the active-harm claim.
- P3 gamma_max = "largest |gamma| at which power = 1" is impossible:
  under the normal model power approaches 1 only asymptotically, so
  no finite gamma_max satisfies power(gamma_max) = 1 exactly. Replaced
  with a doubling-expansion + bisection bracketing strategy.
- P3 audit-notes vs registry-candidate boundary unclear: added an
  explicit HTML-comment boundary marker just before "## Implementation
  Notes", retitled that heading to "(audit notes — NOT registry-
  candidate)", and updated the Status banner to call out the boundary
  explicitly so tentative heuristics + provisional R-package claims
  cannot be misread as part of the proposed REGISTRY replacement text.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ates

R13 verdict was ✅ Looks good (4th consecutive) — addressing remaining
polish items.

- P2 Wald vs NIS power-object conflation: the CS/SA covariance-source
  Note said "the key MDV/power quantity is w' Σ_22^{-1} w". That scalar
  is the current library's Wald / noncentral-χ² object, not Roth's
  paper-analyzed NIS power object. Roth's NIS power is the multivariate
  box probability P(β̂_pre ∈ B_NIS(Σ)) computed under
  β̂_pre ~ N(δ_pre, Σ_22), and Proposition 1's pretest-bias term + Proposition
  3's conditional-variance term rely on truncated-MVN moments. Reworded
  the Note to separate the two surfaces: Roth's NIS box probability and
  the current library's Wald form are BOTH non-conservative under the
  diag fallback, but for different reasons, and the audit needs to pick
  one of them as the surface to promote.
- P3 scipy.stats over-restriction: replaced "scipy.stats has only the
  univariate case" with the accurate distinction —
  `scipy.stats.multivariate_normal.cdf` does cover box probabilities,
  but SciPy lacks a `tmvtnorm`-equivalent for the truncated-MVN MOMENTS
  that Propositions 1 + 3 need.
- P3 discoverability links (sibling updates, follows Bacon precedent at
  REGISTRY.md:2611):
  * REGISTRY.md `## PreTrendsPower` entry now includes
    `Paper review on file: docs/methodology/papers/roth-2022-review.md`
    inline with the primary source citation, with explicit note that
    REGISTRY remains authoritative until the follow-up audit lands.
  * METHODOLOGY_REVIEW.md PreTrendsPower section's "Documentation in
    place" list now records the paper review with the 2026-05-17 date;
    "Outstanding for promotion" has the paper-review bullet removed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R13's REGISTRY pointer pulled the new audit's diag-fallback finding into
authoritative scope; R14 flagged that the authoritative `## PreTrendsPower`
REGISTRY entry didn't yet document the deviation. Adding the inline
documentation per the CLAUDE.md "Documenting Deviations" convention.

- P1 REGISTRY missing `**Note (deviation from paper):**` for diag-VCV
  fallback: added a Note inside the `## PreTrendsPower` Edge cases block
  documenting all three result-type paths (MultiPeriodDiD: full sub-VCV
  when interaction_indices populated, diag otherwise; CS: diag fallback
  even with event_study_vcov available — deliberate; SA: diag fallback
  forced — SunAbrahamResults has no event-study VCV surface). The Note
  cross-references the paper review file and the new TODO entries.
- P2 `compute_pretrends_power(..., violation_type="custom")` is broken
  today: added an explicit "**Helper/class API gap observed today**"
  call-out in the Status block of the paper review, naming the missing
  `violation_weights` parameter on the helper and clarifying that
  `"custom"` is class-only today.
- P3 R-package version pin + diag-fallback follow-up: added three TODO.md
  rows under "Methodology/Correctness":
  * Diag fallback follow-up (Medium) — route CS through event_study_vcov +
    extend SunAbrahamResults with event-study/cohort VCV, or formally
    retain diag with miscalibration framing.
  * R pretrends commit pin (Low) — record the audited revision before
    building parity fixture.
  * compute_pretrends_power "custom" helper gap (Low) — add
    violation_weights to helper signature or document helper as supporting
    only linear/constant/last_period.

All three TODO rows tagged "PR-A (Roth paper review, 2026-05-17)" for
provenance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R15 verdict was ✅ Looks good (5th overall ✅) — addressing remaining polish.

- P3 CS event_study_vcov availability stated too broadly: previous wording
  in REGISTRY.md, roth-2022-review.md, and TODO.md said
  CallawaySantAnnaResults "persists event_study_vcov", but bootstrap CS
  fits explicitly clear that matrix at staggered.py:2032-2036 (to prevent
  mixing analytical VCV with bootstrap SEs). Qualified all three surfaces
  to split non-bootstrap (persists event_study_vcov, diag fallback is a
  deliberate choice) vs bootstrap (clears event_study_vcov, diag fallback
  is the only path) CS paths.
- P3 "four power calculations" parity target underspecified: the
  PreTrendsPower row in METHODOLOGY_REVIEW.md said future R parity should
  cover "the four power calculations" without acknowledging that
  `compute_pretrends_power(..., violation_type="custom")` is unusable from
  the helper today (helper does not accept violation_weights). Added
  clarification that "custom" parity has to run through PreTrendsPower
  directly until the helper is extended; helper-only parity is limited to
  linear/constant/last_period.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R16 verdict was ✅ Looks good (6th overall ✅), with all findings classified
P3-informational. Two of the four findings ("None in this PR" / "TODO rows
are correct") required no action; addressing the two with concrete polish
suggestions.

- P3 helper docstring: `compute_pretrends_power` and `compute_mdv` both
  accept `violation_type` but don't accept `violation_weights`, so
  `violation_type="custom"` is unusable from either helper today. Added
  an explicit Note in the `violation_type` docstring entry of both
  convenience functions pointing users to instantiate `PreTrendsPower`
  directly for custom weights, and cross-referencing the TODO row that
  tracks the helper-extension follow-up.
- P3 METHODOLOGY_REVIEW.md "four power calculations" target: now reads
  "at a pinned revision" with cross-reference to the TODO row that
  tracks the R-package revision pin. Until that lands, the R-package
  surface claims in the paper review remain provisional.

Pyright diagnostics in pretrends.py (matplotlib import, numpy/list type
conflicts, ndarray-to-tuple coercion) are pre-existing in code I did not
touch; my edits are docstring-only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R17 escalated to ⛔ Blocker after the agentic codex reviewer dug into
pretrends.py and surfaced two pre-existing library bugs that my R16
helper docstring update inadvertently directed users toward:

- P0 verified: PreTrendsPowerResults.power_at() silently wrong for
  violation_type="custom". The result dataclass does not persist fitted
  violation_weights (pretrends.py:77-90), and power_at() falls back to
  equal weights with an inline "we can't reconstruct" comment at line
  230-231. My R16 docstring change pointed users to the broken class
  path.
- P1 verified: linear violation pattern at _get_violation_weights()
  (pretrends.py:510-515) ignores actual pre-period relative-time labels
  and constructs a shifted [n-1,...,0] direction from n_pre count alone,
  so for irregular / anticipation-shifted grids the reported MDV is
  NOT in Roth's γ units.

Per user direction, keeping PR-A paper-review-scoped and deferring both
bugs to PR-B (the implementation audit):
- Added two High-priority TODO.md rows tagged "PR-A (surfaced by R17 of
  the iterative codex review)" with explicit line refs + concrete fix
  outlines (persist normalized fitted weights on PreTrendsPowerResults
  for the P0; thread relative-time labels through _get_violation_weights
  for the P1).
- Reverted the R16 helper-docstring recommendation that pointed custom
  users to instantiate PreTrendsPower(..., violation_weights=...). The
  helper now describes the helper-vs-class API gap AND the
  PreTrendsPowerResults.power_at() known issue, without recommending
  the broken class path. Both gaps cross-referenced to TODO.md.

Pyright diagnostics in pretrends.py (matplotlib import, numpy/list type
conflicts, etc.) are pre-existing in code I did not touch; my edits are
docstring + TODO.md only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ar-deviation Note

R18 rejected the R17 TODO-only deferral: "a TODO/docstring note does not
mitigate a silent statistical bug." Per user direction, applying the
minimal partial fix that codex's verdict explicitly offered as acceptable.

- P0 silent-failure guard on PreTrendsPowerResults.power_at: rather than
  doing the full weight-persistence refactor (deferred to PR-B), the
  method now raises NotImplementedError for violation_type="custom" and
  points users to refit with the new M instead. The previous equal-weights
  fallback (with the literal "we can't reconstruct" comment) is replaced
  with a fail-loud guard at the top of power_at; the inner else branch is
  retained as a defensive fallback for future violation_type values.
  Added a Raises section to the docstring documenting the guard. Added
  a regression test
  `test_power_at_raises_on_custom_violation_type` to lock the behavior in;
  full test_pretrends.py suite passes 64/64.
- P1 REGISTRY linear-violation deviation Note: added a `**Note (deviation
  from paper — linear violation pattern):**` block inside the
  `## PreTrendsPower` entry's Violation-types block describing how the
  shipped `[n_pre-1, ..., 1, 0]` direction is built from `n_pre` count
  alone (no relative-time labels), so irregular / anticipation-shifted
  grids do not yield Roth's γ-units MDV. Cross-referenced to TODO.md.
- Silent-failure-guard Note: also added a paired Note documenting the
  power_at custom raise behavior in REGISTRY.md for symmetry with the
  linear deviation Note (both label the corresponding TODO follow-up).

Pre-existing pyright diagnostics in pretrends.py are unrelated to this
diff (my edits are confined to power_at, docstrings, REGISTRY, and a
new test method).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R19 verdict was ✅ Looks good — addressing all 3 cleanup items so the
documentation, tracking, and tests align with R18's actual mitigation.

- P2 helper docstrings stale: `compute_pretrends_power` and `compute_mdv`
  docstrings still said `power_at()` "silently falls back to equal
  weights" even though R18 changed the behavior to raise
  `NotImplementedError`. Updated both docstrings to describe the actual
  current behavior: helper raises `ValueError` from the underlying
  constructor for `violation_type='custom'`; `PreTrendsPowerResults.
  power_at()` on a custom class fit raises `NotImplementedError` because
  fitted weights are not yet persisted.
- P3 TODO.md row stale wording: `TODO.md:100` row still described the
  silent-failure path in present tense. Rewrote to record that the
  silent-failure path was mitigated in R18 (locked by
  test_power_at_raises_on_custom_violation_type) and narrowed the
  remaining follow-up to the weight-persistence work needed to re-enable
  `power_at()` for custom fits. Downgraded priority from High to Medium
  since the silent-bug surface is closed.
- P3 helper-level regression test: added two new tests to
  `TestConvenienceFunctions`:
  * `test_compute_pretrends_power_rejects_custom_violation_type` —
    confirms ValueError when `compute_pretrends_power(...,
    violation_type='custom')` is called without weights.
  * `test_compute_mdv_rejects_custom_violation_type` — mirrors the
    contract for `compute_mdv`.

Full test_pretrends.py suite: 66/66 passing (was 64 + 2 new helper tests).
Pre-existing pyright diagnostics in pretrends.py + test_pretrends.py are
unrelated to this diff.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…allback

R20 verdict was ✅ Looks good. The two methodology + tech-debt P3-informational
findings explicitly required "none required for approval" (already tracked
via REGISTRY notes + TODO rows). Addressed the one finding with a concrete
fix:

- P3 maintainability — PreTrendsPowerResults.power_at() still had an
  unknown-violation_type else fallback that defaulted to equal weights
  (np.ones(n_pre)). That meant a future violation_type added to fit() but
  not threaded through power_at() would silently produce equal-weights
  output. Replaced with an explicit ValueError mirroring the raise at the
  end of _get_violation_weights() (line 530-531). Aligns the two
  weight-reconstruction paths and closes the future-drift silent surface.

Test suite: 66/66 passing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

Methodology

  • Severity P3-informational. Impact: The PR removes a silent wrong-answer path in PreTrendsPowerResults.power_at() for custom fits and replaces it with a documented loud failure. This matches the new REGISTRY.md note, so it is a mitigation, not a new methodology defect. Concrete fix: none required for approval; the tracked follow-up is to persist fitted violation_weights on the results object and re-enable power_at() for custom fits. diff_diff/pretrends.py:L196-L259, docs/methodology/REGISTRY.md:L2798-L2798, TODO.md:L100-L100
  • Severity P3-informational. Impact: The PR correctly records two existing deviations from Roth (2022): the count-based linear direction is not in Roth’s γ units on irregular grids, and some adapters still use diagonal Σ_22 fallback. Because both deviations are explicitly documented in REGISTRY.md, they are not blockers under the review policy. Concrete fix: none required for approval; the follow-up work is already tracked. docs/methodology/REGISTRY.md:L2796-L2814, TODO.md:L97-L101

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3-informational. Impact: PreTrendsPowerResults.power_at() still re-implements weight reconstruction instead of consuming persisted fitted weights, so future violation-type additions still require edits in more than one place. The new ValueError prevents silent corruption, but the drift risk remains. Concrete fix: persist normalized fitted weights on PreTrendsPowerResults and compute power_at() from the stored vector. This is already tracked in TODO.md. diff_diff/pretrends.py:L239-L264, diff_diff/pretrends.py:L516-L559, TODO.md:L100-L100

Tech Debt

  • No new untracked tech debt. The remaining PreTrendsPower follow-ups are explicitly captured in TODO.md:L97-L101.

Security

  • No findings.

Documentation/Tests

  • Severity P3. Impact: REPORTING.md still describes the diagonal-covariance fallback as conservative, but the new authoritative registry/TODO language says dropping off-diagonals is not provably conservative. That leaves two methodology docs disagreeing about the same behavior. Concrete fix: align docs/methodology/REPORTING.md with the new REGISTRY.md note and remove the conservative framing unless it can be proven. docs/methodology/REPORTING.md:L324-L337, docs/methodology/REGISTRY.md:L2809-L2814
  • Severity P3. Impact: The new paper-review file says the reviewed PDF was papers/roth-2022.pdf, but there is no such PDF in the repo tree, so the audit is not fully reproducible from the repository alone. Concrete fix: add the PDF at the documented path, update the path to the real local location, or state that the PDF was reviewed externally and keep only the DOI. docs/methodology/papers/roth-2022-review.md:L3-L7
  • Added regression coverage is appropriate for the changed code paths in tests/test_pretrends.py:L527-L609. I could not execute the tests in this environment because pytest is not installed.

CI codex Round 1 verdict was ✅ Looks good. Addressing the two actionable
P3 doc findings; the other 4 P3 findings were explicit "no fix required"
mitigation confirmations.

- P3 REPORTING.md parity: the diagonal-covariance-fallback Note still
  called the deviation "conservative" at REPORTING.md:324-337, but the
  new authoritative REGISTRY.md note (in this PR) describes it as a
  non-paper approximation that is NOT provably conservative (direction
  of the discrepancy with full Σ_22 depends on sign + magnitude of
  dropped correlations). Aligned the REPORTING note to match the
  REGISTRY language, kept the description of the BusinessReport
  well_powered → moderately_powered downgrade as a practical safeguard
  (not a proof of conservatism), and cross-referenced the REGISTRY
  authoritative deviation block.
- P3 paper-review reproducibility: the roth-2022-review.md header
  listed "PDF reviewed: papers/roth-2022.pdf" but the /papers/ directory
  is gitignored, so the path isn't reproducible from the repo alone.
  Replaced the path with the published-article citation + DOI + an
  explicit note that the PDF was reviewed externally and that the
  /papers/ working directory is gitignored. Added pointers to the
  jonathandroth.com author page for reproduction.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: b3f818100a13bc48d96044ac4bd69a8027fce4aa


Overall Assessment

Looks good

Executive Summary

  • The only runtime behavior change in scope is PreTrendsPowerResults.power_at() for violation_type="custom", and it replaces a silent equal-weights fallback with a loud NotImplementedError. diff_diff/pretrends.py:L196-L279
  • That change matches the authoritative methodology note now added to REGISTRY.md, so I do not see an unmitigated P0/P1 in the changed estimator path. docs/methodology/REGISTRY.md:L2796-L2814
  • The helper/class split for custom violation patterns is now explicit in both docstrings and regression tests, rather than being an implicit failure mode. diff_diff/pretrends.py:L1076-L1183, tests/test_pretrends.py:L527-L609
  • The prior doc drift from the last review is resolved: REPORTING.md no longer calls the diagonal-VCV fallback conservative, and the new paper review clarifies that the reviewed PDF was external rather than expected in-repo. docs/methodology/REPORTING.md:L324-L344, docs/methodology/papers/roth-2022-review.md:L1-L15

Methodology

  • Severity P3-informational. Impact: PreTrendsPowerResults.power_at() now fails loudly for custom fits because the result object does not persist fitted violation_weights, eliminating the prior silent wrong-output path. This behavior is explicitly documented in the authoritative registry, so it is mitigated rather than a new methodology defect. Concrete fix: none required for approval; the follow-up to persist fitted weights is already tracked. diff_diff/pretrends.py:L196-L279, docs/methodology/REGISTRY.md:L2798-L2814, TODO.md:L100-L100
  • Severity P3-informational. Impact: The remaining Roth-2022 mismatches surfaced by the audit, count-based linear weights on irregular grids and diagonal Σ_22 fallback on CS/SA adapters, are now documented with note labels in REGISTRY.md and tracked in TODO.md. Under the stated rubric, that makes them mitigated, not P1 methodology defects. Concrete fix: none required for approval; follow-ups already tracked. docs/methodology/REGISTRY.md:L2796-L2814, TODO.md:L97-L101

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • Severity P3-informational. Impact: PreTrendsPowerResults still reconstructs built-in weight vectors from violation_type and cannot reuse fitted custom weights, so results.power_at(M) remains unavailable for custom fits. The PR correctly converts that from silent corruption into explicit API behavior, but the duplication remains. Concrete fix: persist normalized fitted violation_weights on the results object and have power_at() consume the stored vector. diff_diff/pretrends.py:L77-L90, diff_diff/pretrends.py:L196-L279, TODO.md:L100-L100

Tech Debt

  • Severity P3-informational. Impact: The PR properly records the remaining follow-ups in TODO.md rather than leaving them implicit: Σ_22 fidelity, pinned R revision, helper custom support, custom-weight persistence, and linear-units alignment. There is no new untracked tech debt in scope. Concrete fix: none required for approval. TODO.md:L97-L101

Security

  • No findings.

Documentation/Tests

  • No findings. The changed docs now agree with each other on the diagonal-VCV fallback, and the paper-review file now makes its source provenance explicit. The new regression coverage targets all newly guarded custom entry points. docs/methodology/REPORTING.md:L324-L344, docs/methodology/papers/roth-2022-review.md:L1-L15, tests/test_pretrends.py:L527-L609
  • I could not execute the tests in this environment because pytest is not installed.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 17, 2026
@igerber igerber merged commit f6499db into main May 17, 2026
33 of 34 checks passed
@igerber igerber deleted the feature/roth-2022-paper-review branch May 17, 2026 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant